Introduction

This project is an exploratory analysis inspired by an interest in working through a problem similar to a real-world NBA data science workflow. The goal was to engage with the data, explore patterns, and practice applying data science techniques in a realistic setting. Throughout the analysis, I work through the dataset step by step, allowing insights and questions to emerge naturally. The focus is on clear, concise code and effective visualizations, with an emphasis on producing work that mirrors how data science problems are approached in practice.

Setup and Data

library(tidyverse)

schedule <- read_csv("schedule.csv")
draft_schedule <- read_csv("schedule_24_partial.csv")
locations <- read_csv("locations.csv")
game_data <- read_csv("team_game_data.csv")

Schedule Analysis

In this section, I analyze NBA scheduling data to understand how game density and rest patterns affect teams, focusing on workload rather than on-court talent. I move step by step from defining a clear fatigue metric, to placing teams in historical context, to interpreting whether observed differences are meaningful or just noise.

I begin by defining a concrete measure of schedule density: 4 games in 6 nights. For the Thunder’s 2024–25 draft schedule, I explicitly identify each game that represents the fourth game played within a rolling six-day window. I allow these windows to overlap, since the goal is to capture how often players are exposed to peak workload conditions rather than to count distinct stretches. This gives me a precise count of how frequently OKC faces high-density situations in the upcoming season.

Next, I broaden the scope to the entire league and look historically from 2014–15 through 2023–24. I calculate how many 4-in-6 situations each team experiences per season and then normalize those counts to a standard 82-game season. This lets me establish a league-wide baseline for what a “typical” workload looks like and determine whether OKC’s draft schedule is heavier or lighter than average.

After establishing the average, I compare teams against one another to identify which franchises consistently experience the most and least schedule density. By averaging across seasons, I find the teams with the highest and lowest long-run exposure to 4-in-6 stretches. This comparison helps distinguish whether differences are driven by structural scheduling patterns rather than by one-off seasons.

I then assess whether the gap between the most and least affected teams is actually meaningful. To do this, I run a permutation test that simulates how large the difference would be if schedules were effectively random across teams. By comparing the observed gap to this null distribution, I determine that the difference is not particularly surprising and is likely consistent with random variation rather than systematic bias.

Finally, I shift from schedule structure to in-game performance under fatigue conditions. Using Brooklyn as a case study, I calculate their defensive effective field goal percentage during the 2023–24 season and compare it to their defensive performance when opponents are playing on the second night of a back-to-back. This provides a concrete example of how schedule-related fatigue can translate into measurable changes in on-court outcomes.

four_in_six = function(dates) {
  dates = as.Date(dates) 
  vapply(dates, function(d) sum(dates >= (d - 5) & dates <= d) == 4L, logical(1))
  }

okc_4in6 = draft_schedule %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  filter(team == "OKC") %>%
  arrange(gamedate) %>%
  mutate(is_4in6 = four_in_six(gamedate))

sum(okc_4in6$is_4in6)
## [1] 26

26 4-in-6 stretches in OKC’s draft schedule.

team_rows = function(df) {
  if (all(c("home_team", "away_team") %in% names(df))) 
    {
    home = df %>% transmute(season, gamedate = as.Date(gamedate), team = home_team)
    away = df %>% transmute(season, gamedate = as.Date(gamedate), team = away_team)
    bind_rows(home, away)
    } 
  else {
    df %>% transmute(season, gamedate = as.Date(gamedate), team)
    }
}

team_sched = team_rows(schedule) %>%
  filter(season >= 2014, season <= 2023) %>%   
  arrange(team, season, gamedate)

per_team_season = team_sched %>%
  group_by(team, season) %>%
  arrange(gamedate, .by_group = TRUE) %>%
  mutate(is_4in6 = four_in_six(gamedate)) %>%
  summarise(games = n(), four_in_six = sum(is_4in6), .groups = "drop") %>%
  mutate(four_in_six_per82 = four_in_six * (82 / games))

avg_per82_4in6 = mean(per_team_season$four_in_six_per82, na.rm = TRUE)
avg_per82_4in6
## [1] 25.09998

25.1 4-in-6 stretches on average.

team_avgs = per_team_season %>%
  group_by(team) %>%
  summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE), .groups = "drop")

most_4in6 = team_avgs %>% slice_max(avg_4in6_per82, n = 1)

fewest_4in6 = team_avgs %>% slice_min(avg_4in6_per82, n = 1)

most_4in6
## # A tibble: 1 × 2
##   team  avg_4in6_per82
##   <chr>          <dbl>
## 1 CHA             28.1
fewest_4in6
## # A tibble: 1 × 2
##   team  avg_4in6_per82
##   <chr>          <dbl>
## 1 NYK             22.2
  • Most 4-in-6 stretches on average: Charlotte Hornets (28.1)
  • Fewest 4-in-6 stretches on average: New York Knicks (22.2)
highlight_teams = c("CHA", "NYK")

plot_df = team_avgs %>%
  mutate(highlight = dplyr::case_when(team == "OKC" ~ "OKC", team %in%
                                        highlight_teams ~ "Highlighted", TRUE ~ "Other"))

bold_labels <- c("CHA","NYK","OKC")

ggplot(plot_df, aes(x = reorder(team, avg_4in6_per82), y = avg_4in6_per82, fill = highlight)) +
  geom_col() +
  coord_flip() +
  geom_text(aes(label = sprintf("%.1f", avg_4in6_per82)), hjust = -0.1, size = 3) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.10))) +
  scale_fill_manual(values = c("Other" = "steelblue","Highlighted" = "burlywood", 
                               "OKC" = "darkorange"), guide = "none") +
  labs(
    title = "Per-82 Average 4-in-6 Stretches by Team (2014–15 to 2023–24)",
    subtitle = "Charlotte Hornets, New York Knicks highlighted; OKC in orange",
    x = "Teams", y = "Average 4-in-6 per 82 games") + theme_minimal(base_size = 10)

set.seed(42)

team_avg_diff = diff(range(team_avgs$avg_4in6_per82, na.rm = TRUE))

perm = function(df) {
  df %>%
    group_by(season) %>%
    mutate(team = sample(team)) %>%   
    ungroup() %>%
    group_by(team) %>%
    summarise(avg_4in6_per82 = mean(four_in_six_per82, na.rm = TRUE), .groups = "drop") %>%
    summarise(gap = max(avg_4in6_per82) - min(avg_4in6_per82)) %>%
    pull(gap)
}

B = 5000
null_diff = replicate(B, perm(per_team_season))


p_val = mean(null_diff >= team_avg_diff)

q95 = quantile(null_diff, 0.95)

list(observed_gap = team_avg_diff, p_value = p_val, null_95th_percentile = q95)
## $observed_gap
## [1] 5.923077
## 
## $p_value
## [1] 0.0664
## 
## $null_95th_percentile
##      95% 
## 6.044829
  • I believe that the size difference is likely to be the result of chance based on the fact that the permutation test found the p-value to be 0.0664, meaning that a gap that big would occur just by luck about 6-7 times out of 100. The difference would be surprising if it would happen fewer than 5 times out of 100, which means the p-value would have to be less than or equal to 0.05.
gd = game_data %>%
  mutate(gamedate = as.Date(gamedate), season   = as.integer(season)) %>%
  group_by(season, off_team) %>% arrange(gamedate, .by_group = TRUE) %>%
  mutate(is_second_b2b = as.integer(gamedate - lag(gamedate) == 1)) %>% ungroup()

opp_vs_bkn_2023 = gd %>% filter(season == 2023, def_team %in% c("BKN","BRK"), fgattempted > 0)

bkn_def_efg = 100 * with(opp_vs_bkn_2023,
  sum(fgmade + 0.5 * fg3made, na.rm = TRUE) / sum(fgattempted, na.rm = TRUE))

opp_b2b = opp_vs_bkn_2023 %>% filter(is_second_b2b == 1)

bkn_def_efg_b2b = 100 * with(opp_b2b,
  sum(fgmade + 0.5 * fg3made, na.rm = TRUE) / sum(fgattempted, na.rm = TRUE))

sprintf("BKN Defensive eFG%%: %.1f%%", bkn_def_efg)
## [1] "BKN Defensive eFG%: 54.3%"
sprintf("When opponent on a B2B (second night): %.1f%%", bkn_def_efg_b2b)
## [1] "When opponent on a B2B (second night): 53.5%"
  • BKN Defensive eFG%: 54.3%
  • When opponent on a B2B: 53.5%

.

.

Modeling

In this section, I build a simple model to estimate how much a team’s schedule has helped or hurt its total regular-season wins due to schedule-related factors from the 2019–20 through 2023–24 seasons. The goal is not to predict wins, but to isolate the portion of win totals that can reasonably be attributed to schedule structure, rather than team quality.

I start by standardizing the schedule data across seasons, ensuring consistent game dates, team identifiers, and home/away flags. This allows me to reliably track how many games each team plays at home versus on the road over the full multi-season window. I restrict the analysis to games from 2019–20 onward to focus on a modern scheduling context.

To translate schedule structure into wins, I estimate a league-average home-court advantage directly from the data. I calculate the difference between home win percentage and away win percentage across all games in the sample, and cap this estimate within a reasonable range to avoid extreme values driven by noise. This produces a conservative, data-driven estimate of how much playing at home is worth in terms of win probability.

Using this estimate, I then compute each team’s home-game imbalance, which is how much their share of home games deviates from a perfectly balanced 50/50 split. I multiply that imbalance by the total number of games played and by the estimated home-court edge to convert schedule structure into an estimated number of wins gained or lost purely due to scheduling.

I aggregate these effects at the team level across all seasons in the sample, producing a single number per team that represents how much the schedule has helped or hurt them overall. I then visualize these estimates in a bar chart centered at zero, making it easy to compare teams that benefited from schedule imbalance to those that were disadvantaged.

Finally, I identify the teams most helped and most hurt by schedule effects. The results show that while the magnitude of these effects is modest, they are still meaningful at the margins, especially in a league where playoff seeding and tiebreakers often come down to one or two games.

Overall, this modeling approach provides a rough but interpretable estimate of schedule-driven win impact. It intentionally prioritizes transparency and realism over complexity, and the results should be interpreted as directional rather than exact, capturing how schedule structure alone can influence outcomes over time.

schedule = schedule %>% 
  mutate(gamedate = as.Date(if ("gamedate" %in% names(schedule)) gamedate
                            else if ("game_date" %in% names(schedule)) game_date
                            else date))

home_away = {
  s = schedule
  has = function(x) x %in% names(s)
  if (has("home")) {
    s = s %>% 
      mutate(home = case_when(
        is.logical(home) ~ home,
        tolower(as.character(home)) %in% c("1","t","true","yes","y","h","home") ~ TRUE,
        tolower(as.character(home)) %in% c("0","f","false","no","n","a","away") ~ FALSE,
        TRUE ~ NA
      ))
  } else if (has("home_team") && has("team")) {
    s = s %>%  mutate(home = team == .data$home_team)
  } else if (has("homeaway")) {
    s = s %>%  mutate(home = tolower(as.character(homeaway)) %in% c("home","h"))
  } else if (has("location")) {
    s = s %>%  mutate(home = tolower(as.character(location)) %in% c("home","h"))
  } else {
    s = s %>%  mutate(home = NA)
  }
  s %>%  transmute(team, gamedate, home = as.logical(home))
}

if (!exists("gd_feat")) {
  gd_feat = schedule %>% 
    select(team, opponent, gamedate) %>% 
    left_join(home_away, by = c("team","gamedate")) %>% 
    mutate(rest_days = NA_real_, dist_km = NA_real_)
}


ha_19_24 = home_away %>% 
  filter(year(gamedate) >= 2019, year(gamedate) <= 2024)

win_df = NULL
if (exists("gd_feat") && all(c("team","gamedate","win") %in% names(gd_feat))) {
  win_df = gd_feat %>% 
    transmute(team, gamedate = as.Date(gamedate), win = as.numeric(as.logical(win)))
} else if ("win" %in% names(schedule)) {
  win_df = schedule %>% 
    transmute(team, gamedate = as.Date(if ("gamedate" %in% names(schedule)) gamedate else
                                       if ("game_date" %in% names(schedule)) game_date else date), win = as.numeric(as.logical(win)))
}

ha_out = ha_19_24 %>% 
  left_join(win_df, by = c("team","gamedate"))

estimate_edge = function(df) {
  if (!"win" %in% names(df) || all(is.na(df$win))) return(0.14)          
  df2 = df %>%  filter(!is.na(home), !is.na(win))
  if (nrow(df2) < 100) return(0.14)
  home_wp = mean(df2$win[df2$home %in% TRUE],  na.rm = TRUE)
  away_wp = mean(df2$win[df2$home %in% FALSE], na.rm = TRUE)
  edge = home_wp - away_wp
  pmin(pmax(edge, 0.08), 0.20)
}

home_edge = estimate_edge(ha_out)

team_totals = ha_19_24 %>% 
  group_by(team) %>% 
  summarise(
    games = n(),
    home_share = mean(home, na.rm = TRUE),
    .groups = "drop"
  ) %>% 
  mutate(schedule_wins = (home_share - 0.5) * games * home_edge) %>% 
  arrange(desc(schedule_wins))

most_helped = team_totals %>%  slice_max(schedule_wins, n = 1)
most_hurt = team_totals %>%  slice_min(schedule_wins, n = 1)

team_totals %>% 
  mutate(team = forcats::fct_reorder(team, schedule_wins)) %>% 
  ggplot(aes(team, schedule_wins)) +
  geom_col() +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.6) +
  scale_y_continuous(labels = function(x) sprintf("%+.1f", x)) +
  labs(
    title = "Estimated Wins Gained/Lost from Home-Share Imbalance (2019–2024)",
    x = NULL, y = "Wins due to home-share imbalance"
  ) + theme_minimal(base_size = 12)

  • Most Helped by Schedule: Cleveland Cavaliers (+0.6 wins)
  • Most Hurt by Schedule: Orlando Magic (-0.6 wins)